Prosper Loan Data Analysis by Libin Guo

Prosper, or Prosper Marketplace, is a leader in the online peer-to-peer lending industry. Borrowers create profiles and listings (request loans), investors either individuals or institutions, view the listing (borrower’s loan request) and decide how much to lend the borrower towards the loan.
This Prosper Loan data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
There are 3 stakeholder in this business model – borrower, investor and the company Prosper. From the business perspective, satisfying need of target customer is the most important thing . This project will focuse in customer analysis and explore profit-driven loan for borrower and lender and conduct time study to see loans changes over the year. Firstly, it will explore the classification of target customer - borrower and lender, respectively. Secondly, for borrower, the analysis will explore which group of borrower has higher chance to have the changed off debt. Thirdly,the analysis is trying to figure out what kind of investment(combined factors) is the most popular for investor.

Univariate Plots Section

In the begining of the project, I want to use univariate plots to explore the overall statistic for Prosper loans and its customer distribution. At first, let’s check the overall distribution of Prosper ratings listing to understand more of loan characteristics.

nrow(loan) # check number of rows 
## [1] 113937
ncol(loan) #check number of columns
## [1] 81
names(loan) #Check the name of the dataset
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Loan Prosper Rating is the fisrt factor I want to check

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 16965 58308  7433  7189  2395   756  2572 10494   199    85    91   217 
##    12    13    14    15    16    17    18    19    20 
##    59  1996   876  1522   304    52   885   768   771

From Loan Prosper Rating histogram, we can see there is near 30,000 unrated listing. Also, Rating C loans has highest number among all the listing, which is close to 18,000 listings.B, C rating Loan are top 2 and to 3 popular ratings. Also,from Listing Category Frequency Histogram, we can also learn that Debt Consolidation is Top1 category of all the listing category and has 58308, which is about 50% of total number of loans.Let’s also plot a time study histogry to see the number of loans.

## Q1 2006 Q1 2007 Q1 2008 Q1 2010 Q1 2011 Q1 2012 Q1 2013 Q1 2014 Q2 2006 
##     315    3079    3074    1243    1744    4435    3616   12172    1254 
## Q2 2007 Q2 2008 Q2 2009 Q2 2010 Q2 2011 Q2 2012 Q2 2013 Q3 2006 Q3 2007 
##    3118    4344      13    1539    2478    5061    7099    1934    2671 
## Q3 2008 Q3 2009 Q3 2010 Q3 2011 Q3 2012 Q3 2013 Q4 2005 Q4 2006 Q4 2007 
##    3602     585    1270    3093    5632    9180      22    2403    2592 
## Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q4 2013 
##     532    1449    1600    3913    4425   14450

Year variable needed to be extracted and transfomed from quater variable at first and plot histogram using both time varaibles. From these two graphs, we can see the number of total Prosper loan has ups and downs trend from 2006 to 2014. It went up from 2006 to 2008 and dropped heavily in 2009, which may be caused by financial crisis in 2008, then it increased again until 2014 Q1, 2014 only has 1 quarter data, which can not represent whole year. Overall, the total number of loan are increasing over the years.

Now I have seen some loan information, I want to know the cluster customer by Customer Risk Score to its distribution.

From the chart, we can conclude the customers’ risk score is normally distributed and about 70% customers’ risk score are between 3.5 and 8.5.From Borrow State Frequency chart, we can see CA- California has the highest number of loans, which is more than twice than the secondest highest state -Texas. This makes sense as Prosper’s was founded in California and it should has the largest base customer. After exploring borrower data, let’s check investor related variable - number of investor per loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

The max number of investors per loan is 1189, while density graph shows that 95% of loan’s number of investor are in the range of (0,290). The median number of investors are 44. This variable should be investigated later to see which loan attracts investors the most. Another factors that is important for investor is return rate, let’s check the return rate distribution using bar chart. Since EstimatedReturn has a lot of missing data, I need to remove them at first.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084

The avereage return rate is 9.6% overall, which is about the center of this frequency ploygen charts. There is one outler about return rate 13%, which I want to explore later. Also, Loan Amount distribution will be a interesting metric. At first, i see a very long tailed histogram and I decided to use 90% quantile to see the majority of loan Amount lies.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Obviouslly, 90% of loan amount are between 1000 and 15000, and the max amount is $35,000, the median amount is $6500. Now we know most people has median risk score, the 3rd question I will be interested is how long will people usually choose for loan? Let’s answer this question with a histogram. Since the terms is months unit, I will tranform it into year)

We can see from the result that about 3 years terms is most popular option to pay the loan.So far, we got a general idea of the prosper loan product, let’s check the overall statistic for Loan status to see how well people pay their loan.

From this fipped histogram, we can see about 33% of loan are completed, almost 50% loan are still in program, while about 10% loan are changed off. a low charged off loan rate is a indicator of good loan product. To help prosper succeed, I want to figure out which factors lead to higher chance of loan being charged off. I will use scatter plot to explore the coefficient between variable and loan. Since we explore terms of years and loan status in Univariate Analysis, I want to limit the plot to loan status ‘Completed’ and ‘Chargedoff’ by terms to see how differnt terms of loan varies for these two categories.

From above two bar talbe, we can see most terms in completed status are 3 years terms, in non_completed status, there is higher percentage of 5 years terms, this makes sense to the data.

Univariate Analysis

What is the structure of your dataset?

This dataset contains 81 variable, these variable can be generally divided as 4 groups - loan/Listing variables, borrower variable, LP-Prosper Payment related variable, and time variabls. This data contains continuous, binary, nominal and discrete data. I will use different plot for different data type.

What is/are the main feature(s) of interest in your dataset?

I am interesting in which customer group have a higher probablity of paying loan on time and which customer groups have a higher chance of loans being charged off.Secondly, I am also curious to see which type of loan factor affects investor’s the most, so that Prosper can benefit and make strategy of selling different loans.lastly, I also want to investigate the trend of customer/Profit change over the year to validate the past business strategy is working.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think borrower geographical and demographical information will be very helpful for investigating the customer group. Also, EstimatedReturn rate may be a good measurement for a good lender and I can explore the correlation between EstimatedReturn rate and Furthermore, columns with prefix ‘LP_’ can be used to calculate the profit Prosper gain from each listing.

Did you create any new variables from existing variables in the dataset?

Yes, I created year varibale form LoanOriginationQuarter and percentage variable by Customer risk score and by loan status

Of the features you investigated, were there any unusual distributions?

Yes, I transformed term into year unit to have a better understanding of length of loan terms since terms in months is very scatterred. After transforming, it is clear to see 3 years’ term is the most popular option for borrower.

Bivariate Plots Section

Additionally,let’s check the average borrower LoanOriginalAmount per loan status to see if high risk borrower have higher percentage of charged off loans. , there are a lot N/A value among completed, current and charged off listing status. Therefore, before computing the average value, N/A value is removed.

## # A tibble: 12 x 3
##    LoanStatus             Avg_ProsperScore     n
##    <fct>                             <dbl> <int>
##  1 Past Due (>120 days)               4.75    16
##  2 Past Due (16-30 days)              4.97   265
##  3 Past Due (61-90 days)              5.01   313
##  4 Past Due (1-15 days)               5.04   806
##  5 Past Due (31-60 days)              5.11   363
##  6 Past Due (91-120 days)             5.22   304
##  7 Chargedoff                         5.39 11992
##  8 Defaulted                          5.62  5018
##  9 FinalPaymentInProgress             5.75   205
## 10 Current                            5.84 56576
## 11 Completed                          6.54 38074
## 12 Cancelled                        NaN        5

From the bar chart, we can see Completed group of borrower has the highest average Prosper Score, which means they have lowest risk. apparantly, prosper score is a relaible indicator of good loan candidate.

The higher LoanOriginalAmount means higher profit contribution of borrower, which can help Prosper make suitable marketing plan. On the other hand, years of employment can be a good meansurement to find the target custmor

## # A tibble: 6 x 4
##   experience avgAmt sumAmt     n
##        <dbl>  <dbl>  <int> <int>
## 1       45.7  25000  50000     2
## 2       47.0  25000  25000     1
## 3       46.2  18920  37840     2
## 4       36.8  17889 161000     9
## 5       41.7  15000  15000     1
## 6       44.1  15000  30000     2

The Scatter Plot of loan Amount vs Years of Employment Experience line shows no relationship between oan Amount and Years of Employment. Let’s try to use average LoanOriginalAmount by experience of employment at first From Average Loan Amount by Years of Employment Experience Line Chart, we can see the years of Employment Experience between 45 and 50 has the highest average LoanOriginalAmount, which is outlier, much more than other experience group. This group have high demand of loan amount per person which can be further explored.

## [1] 0.3389275

From Scatter Plot of Total Loan Amount vs Year, we cannot see proper rating system between 2006 to 2008 because the missing value form prosper rating column have been removed, which means from 2006 to 2008, there is no prosper rating systerm. Plus,the total loan amount is increasing from 2010 to 2012. large amount of 2013 loan and 2014’s loans have not been rated yet.

Also, from scatter plot of Loan Amount vs Term in year has positive correlation,their coefficient correlation is 0.3389, which is moderate level. Another question I have in mind is EstimatedReturn rate’s relationship with number of investors. Does higher return rate attract more investor? Let’s find out in scatter plot.

The estimated return and number of investor are not totally positively correlated. The range of estimated return between 5% and 17% attract the majority of investor, the rate between 4% and above 18% attract compararty small investor. This can be investigated in next section. Additionally, I also want to see the EstimatedReturn rate’s relationship with ProsperRating to see if they have strong correlation using boxplot. Do higher prosper rating have higher return rate?

From the boxplot, we can tell the highest risk rating HR result in second highest average return rate,while it has long range of return rate from -20% to 14%. Also, from AA -E, the average return rate goes up even the rating goes down. ProsperRating and return rate seem have weak negative correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in

the dataset? From Average Prosper Score vs Loan Status chart, we can see apparantly people who has completed loan carries highest prosper score among all category. This also proves the effectiveness and accurancy of prosper score.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

from the Employment Experience vs LoanOriginalAmount charts, we can conclude that group ‘between 45 and 50’ has the highest average LoanOriginalAmount, which is outlier. This is very interesting because it shows this age group people likes to borrow more debts from Prosper.

What was the strongest relationship you found?

Among, above analysis, LoanOriginalAmount and Term in year has positive correlation,their coefficient correlation is 0.3389, which is moderate level.

Multivariate Plots Section

As I stated in univariate analysis summary, geographical information can be a good part to segment borrower. In this section, 1. I will analyze borrowers’ data from geographical and demographical factor. 2. I will further explore EstimatedReturn and chargedoff loss for investor’ relationship with loan and borrower’s factor. 3. I will conduct a time study for overall proper loan and investor’s preference for loans.

1.1 ProsperScore vs Loan Status (completed and charged off) vs Borrower State.

This result is really intriguing, almost all the state’s completed loan borrower have higher prosper score except MT- Montana. Motana’s chargedoff loan borrower has about 6.8 Prosper score , very high compared to its completed loan borrower score 5.5. Montana is very unique case.

1.2 State vs LoanOriginalAmount vs IsBorrowerHomeowner

We can see from the graph that people who are homeowner will have higher loan than people who does not own a home in all state. IsBorrowerHomeowner is definitely a motivator for people to loan. additionaly, DC, CA and HI are top 3 states who have highest Medium Loan Original Amount. DC borrower loans much higher than other states in the chart.

2.1 I EstimatedReturn vs ProsperRating facet vs Employment status

The Estimated Return rate for different employment type of borrower does not vary much, but they do have different preference for different category of loan.

2.2 LP_GrossPrincipalLoss vs ProsperRating vs Income Range

The investment always have gains and loss. Let’s further investigate the factors that influence borrowers LP_GrossPrincipalLoss because this will affect the rating of loan and evaluate which loan will investor loss the most money on.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -94.2     0.0     0.0   700.4     0.0 25000.0

Surprisingly, Low risk loan have higher total amount of loss, this is probably due to low risk loan has more investment. Also, high income group ($75,000 -99,999) has highest percentage in HR High risk rating loan. apearantly, the higher the income,the more risk people are willing to take.

2.3 EstimatedReturn vs vs ProsperRating by Income Range

We can see that HR rating loan has the widest range of estimated return, from -20% to 18%. Rating A, AA low risk loan don’t have negative return rate.

3.1 Time study of Trend of customer and investor preference for loan.

Before ploting, I remove 2005 and 2014 year since they only have one quater data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Before 2009, Prosper did not have prosper rating system for their loan, therefore, I remove years before 2009 and 2014 and remove missing value in prosper rating.The total loan amount and number of investors is increasing from 2006 to 2012, which is good sign that Prosper’s business is growing. However, compared to 2012, 2013’s number is very low. In bivariate analysis, I plot year vs total loan amount, the 2013 number is higher than 2012, which may mean a large number of 2013’s loan have not been rated yet.

Let’s also check loan return rate vs vs Proper rating over years.

The graphy shows that from 2010 to 2013, the average estimated return rate is acutually decreasing for all rating of loans.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

1.From 1.2 charts of Median LoanOriginalAmount vs IsBorrowerHomeowner by state shows that people who are homeowner will have higher loan than people who does not own a home in all state. IsBorrowerHomeowner is definitely a motivator for people to loan. additionaly, DC, CA and HI are top 3 states who have highest Medium Loan Original Amount. DC borrower loans much higher than other states in the chart and top natinalwide.

  1. From 3 times study graphs, we can learn before 2009, Prosper did not have prosper rating system for their loan.The total loan amount and total number of investors is increasing from 2006 to 2013, which is a sign that Prosper’s business is growing.The 3rd time study graph shows that from 2010 to 2013, the average estimated return rate is acutually decreasing for all rating of loans.

Were there any interesting or surprising interactions between features?

In bar chart of Prosper Score by Loan Status -(Completed, charged off) by state, all the state’s completed loan borrower have higher average prosper score except MT- Montana. Motana’s chargedoff loan borrower has about 6.8 Prosper score, very high compared to its completed loan borrower score 5.5.

furthermore, from chart of GrossPrincipalLoss vs ProsperRating by Income Range, surprising, Low risk loan have higher total amount of loss, this is probably due to low risk loan has more investment. Also, high income group($75,000 -99,999) has highest percentage in HR High risk rating loan. apearantly, the higher the income,the more risk people are willing to take.


Final Plots and Summary

Plot One

Description One

The graph shows that people who are homeowner will have higher loan than people who does not own a home in all state. IsBorrowerHomeowner is definitely a motivator for people to loan. additionaly, DC, CA and HI are top 3 states who have highest Medium Loan Original Amount. DC borrowers’ average loans is about $11,000, which is much higher than other states nationalwide. Hawaii borrower has the highest percentage of not owning home among all the states.

Plot Two

Description Two

The graphs show that low risk loan have higher total amount of loss, compared to total loan amount per rating group. Also, high income group($75,000 -99,999) has highest percentage in HR High risk rating loan. Therefore, we can conclude the higher the income,the more risk people are willing to take. In addition, we can see that HR rating loan, which has the lowest total loan amount, has the widest range of estimated return, from -20% to 18%. Rating A, AA low risk loan don’t have negative return rate. Rating Loan D and E have the highest average return rate.

Plot Three

Description Three

Before 2009, Prosper did not have prosper rating system for their loan, therefore, I remove years before 2009 and 2014 and remove missing value in prosper rating.The total loan amount and number of investors is increasing from 2006 to 2012, which is good sign that Prosper’s business is growing. However, I noticed, compared to 2012, 2013’s number is very low. In bivariate analysis, I plot year vs total loan amount, the 2013 number is higher than 2012, which may mean a large number of 2013’s loan have not been rated yet since the latest quater is Q1 2014. Also, from these two charts, we can see prosper rating D and C loan are most popular products For investor and borrowers in Prosper in 2012, especially for borrower, these two loans are the most attractive one. While loan E’s percentage of investors’ number is decreasing from 2010 to 2012.


Reflection

The biggest struggle I faced in the begining of this project is to find the most interesting question to ask and select useful variables from these 81 variables.

In order to come up a good business question to ask, I did reseach on prosper.com to understand its business model,read data disctionary several times to understand the meaning of each columns and browse blog of this dataset online. Then I conduct summary() function to check the value of each value to understand the data structure better, and I believe data analysis is designed to help business answer question,and the 1st thing about business is customer-focused and product focused. therefore, I designed the project to perform customer analysis and explore profit-driven loan for borrower and lender and conduct time study to see loans changes over the year. 1. Explore the classification of target customer - borrower and lender, respectively. 2. For borrower, the analysis will explore which group of borrower has higher chance to have the changed off debt. 3. Identify what kind of investment(combined factors) is the most popular for investor.

I succesfully divided data variable into different groups -borrower, loan, investors,and LP payment, and created all kinds of charts using ggplot2. Firstly,I created plot to show Average Prosper Score by Loan Status by state, and noticed that the better risk score and the higher completion rate.I also find out the terms and Loan amount have moderate positive correlation and home ower will more likely to request higher amount loan. For loan product side, I discovered that product B and C are most popular loan which attracts the most of investors.

Furthermore, a surprisingly interesting pattern that the lower risk loan is, the higher risk of being charged off. Also, another surprising discovery is from the Employment Experience vs LoanOriginalAmount charts, which shows group ‘between 45 and 50’ has the highest average LoanOriginalAmount, which is outlier. This is very interesting because it shows this age group people likes to borrow more debts from Prosper.

For the future improvement of data anlaysis, I will consider to discover more variable with strong/weak coefficient correlation between return rate of loan, and build a linear/logistic regression model to predict return rate for more loan product.Statistical tests, for example ttest,could be used to further explore this dataset.For instanct, ttest can be conducted using two samples from the same rating loan from 2009 and 2012 to infer the whole population to determine whether the average loan amount for loans in 2009 and 2012 are different or not.